feat: add operational counters for sampling, requirements, and tools (#467)#883
feat: add operational counters for sampling, requirements, and tools (#467)#883ajbozarth wants to merge 2 commits intogenerative-computing:mainfrom
Conversation
…enerative-computing#467) Adds six new OpenTelemetry counters giving operators visibility into retry behaviour, validation failure rates, and tool call health: mellea.sampling.attempts/successes/failures, mellea.requirement.checks/failures, and mellea.tool.calls. Follows the established lazy-init globals + record_* helpers + Plugin hooks pattern. Extends SamplingIterationPayload and SamplingLoopEndPayload with a strategy_name field so plugins can tag counters by strategy class. Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
|
The PR description has been updated. Please fill out the template for your PR to be reviewed. |
PeriodicExportingMetricReader background threads (60 s default) would fire after pytest closed stdout once the suite crossed the 60 s mark, causing "I/O operation on closed file" and OTLP UNAVAILABLE errors. Assisted-by: Claude Code Signed-off-by: Alex Bozarth <ajbozart@us.ibm.com>
|
This is parallel work to #882 which adds the other remaining metrics telemetry and whichever merges second will need a manual merge to deal with the overlapping git diffs. |
|
This looks good to me as the implementation of the issue #467. I wonder how these counters are useful. They are grouped by strategies and requirements. They may not give useful insight without some more fine groupings. |
akihikokuroda
left a comment
There was a problem hiding this comment.
LGTM. I have a general comment about the issue that this PR implements.
@jakelorocco @psschwei I actually wondered about this while implementing. I thought the idea was interesting and valuable, but my work did not create an example of how it could be used (which maybe I should add?) |
Off the top of my head, it could be useful to know if a particular requirement is hard to satisfy for a particular model (for example, maybe base granite doesn't do well on quantum and we need to substitute in the granite-qiskit model). I'm not sure we necessarily need an example though (I'm a little worried about having too many examples to maintain, xref #811 ). Not opposed either. |
psschwei
left a comment
There was a problem hiding this comment.
Found one issue with metric cardinality — each unique combination of attribute values creates a separate time series in your metrics backend.
Step 1: A requirement validation fails. The ValidationResult object has a .reason field containing a human-readable explanation of why it failed.
Step 2: The new RequirementMetricsPlugin (added in this PR) fires on the validation_post_check hook. It grabs reason from the result:
reason = getattr(result, "reason", None) or "unknown"
record_requirement_failure(req_name, reason)Step 3: record_requirement_failure (also added in this PR) passes that reason string directly as a counter attribute:
_get_requirement_failures_counter().add(
1, {"requirement": requirement, "reason": reason}
)Step 4: Here's the problem. For some requirement types — specifically LLM-as-a-Judge (mellea/core/requirement.py:187, pre-existing code) — the reason is set to the full LLM output text. That's a different string nearly every time. Each unique reason string creates a new time series in the metrics backend.
So if you run 1,000 LLM-as-a-Judge validations that fail, you get ~1,000 distinct time series for mellea.requirement.failures, each with a count of 1. Metrics backends like Prometheus have cardinality limits and will start dropping data or degrading performance.
What you'd want instead: Either drop reason from the metric attributes (put it in traces/logs instead), or map it to a small bounded set of categories like "llm_judge", "execution_error", "timeout", etc.
Misc PR
Type of PR
Description
Adds six new OpenTelemetry counters as part of the telemetry epic (#443),
giving operators visibility into retry behaviour, validation failure rates,
and tool call health alongside the existing LLM-level metrics:
mellea.sampling.attempts/mellea.sampling.successes/mellea.sampling.failures— tagged by strategy class namemellea.requirement.checks/mellea.requirement.failures— tagged by requirement class name and failure reasonmellea.tool.calls— tagged by tool name andstatus("success"/"failure")Follows the established plugin pattern: lazy-init counter globals +
record_*helpers in
metrics.py, consumed by three new Plugin classes inmetrics_plugins.py(SamplingMetricsPlugin,RequirementMetricsPlugin,ToolMetricsPlugin) that hook into existing hook events inFIRE_AND_FORGETmode. No metric calls in business-logic files.
Also extends
SamplingIterationPayloadandSamplingLoopEndPayloadwith astrategy_namefield (backward-compatible, defaults to"") so plugins cantag counters with the strategy class name.
Testing
Attribution